40 research outputs found
Improving Continuous Sign Language Recognition with Cross-Lingual Signs
This work dedicates to continuous sign language recognition (CSLR), which is
a weakly supervised task dealing with the recognition of continuous signs from
videos, without any prior knowledge about the temporal boundaries between
consecutive signs. Data scarcity heavily impedes the progress of CSLR. Existing
approaches typically train CSLR models on a monolingual corpus, which is orders
of magnitude smaller than that of speech recognition. In this work, we explore
the feasibility of utilizing multilingual sign language corpora to facilitate
monolingual CSLR. Our work is built upon the observation of cross-lingual
signs, which originate from different sign languages but have similar visual
signals (e.g., hand shape and motion). The underlying idea of our approach is
to identify the cross-lingual signs in one sign language and properly leverage
them as auxiliary training data to improve the recognition capability of
another. To achieve the goal, we first build two sign language dictionaries
containing isolated signs that appear in two datasets. Then we identify the
sign-to-sign mappings between two sign languages via a well-optimized isolated
sign language recognition model. At last, we train a CSLR model on the
combination of the target data with original labels and the auxiliary data with
mapped labels. Experimentally, our approach achieves state-of-the-art
performance on two widely-used CSLR datasets: Phoenix-2014 and Phoenix-2014T.Comment: Accepted by ICCV 202
Conditional DETR V2: Efficient Detection Transformer with Box Queries
In this paper, we are interested in Detection Transformer (DETR), an
end-to-end object detection approach based on a transformer encoder-decoder
architecture without hand-crafted postprocessing, such as NMS. Inspired by
Conditional DETR, an improved DETR with fast training convergence, that
presented box queries (originally called spatial queries) for internal decoder
layers, we reformulate the object query into the format of the box query that
is a composition of the embeddings of the reference point and the
transformation of the box with respect to the reference point. This
reformulation indicates the connection between the object query in DETR and the
anchor box that is widely studied in Faster R-CNN. Furthermore, we learn the
box queries from the image content, further improving the detection quality of
Conditional DETR still with fast training convergence. In addition, we adopt
the idea of axial self-attention to save the memory cost and accelerate the
encoder. The resulting detector, called Conditional DETR V2, achieves better
results than Conditional DETR, saves the memory cost and runs more efficiently.
For example, for the DC-ResNet- backbone, our approach achieves
AP with FPS on the COCO set and compared to Conditional DETR, it
runs faster, saves \% of the overall memory cost, and improves
AP score
Boosting Zero-shot Learning via Contrastive Optimization of Attribute Representations
Zero-shot learning (ZSL) aims to recognize classes that do not have samples
in the training set. One representative solution is to directly learn an
embedding function associating visual features with corresponding class
semantics for recognizing new classes. Many methods extend upon this solution,
and recent ones are especially keen on extracting rich features from images,
e.g. attribute features. These attribute features are normally extracted within
each individual image; however, the common traits for features across images
yet belonging to the same attribute are not emphasized. In this paper, we
propose a new framework to boost ZSL by explicitly learning attribute
prototypes beyond images and contrastively optimizing them with attribute-level
features within images. Besides the novel architecture, two elements are
highlighted for attribute representations: a new prototype generation module is
designed to generate attribute prototypes from attribute semantics; a hard
example-based contrastive optimization scheme is introduced to reinforce
attribute-level features in the embedding space. We explore two alternative
backbones, CNN-based and transformer-based, to build our framework and conduct
experiments on three standard benchmarks, CUB, SUN, AwA2. Results on these
benchmarks demonstrate that our method improves the state of the art by a
considerable margin. Our codes will be available at
https://github.com/dyabel/CoAR-ZSL.gitComment: Accepted to TNNL
RAIN: Your Language Models Can Align Themselves without Finetuning
Large language models (LLMs) often demonstrate inconsistencies with human
preferences. Previous research gathered human preference data and then aligned
the pre-trained models using reinforcement learning or instruction tuning, the
so-called finetuning step. In contrast, aligning frozen LLMs without any extra
data is more appealing. This work explores the potential of the latter setting.
We discover that by integrating self-evaluation and rewind mechanisms,
unaligned LLMs can directly produce responses consistent with human preferences
via self-boosting. We introduce a novel inference method, Rewindable
Auto-regressive INference (RAIN), that allows pre-trained LLMs to evaluate
their own generation and use the evaluation results to guide backward rewind
and forward generation for AI safety. Notably, RAIN operates without the need
of extra data for model alignment and abstains from any training, gradient
computation, or parameter updates; during the self-evaluation phase, the model
receives guidance on which human preference to align with through a
fixed-template prompt, eliminating the need to modify the initial prompt.
Experimental results evaluated by GPT-4 and humans demonstrate the
effectiveness of RAIN: on the HH dataset, RAIN improves the harmlessness rate
of LLaMA 30B over vanilla inference from 82% to 97%, while maintaining the
helpfulness rate. Under the leading adversarial attack llm-attacks on Vicuna
33B, RAIN establishes a new defense baseline by reducing the attack success
rate from 94% to 19%
Two-Stream Network for Sign Language Recognition and Translation
Sign languages are visual languages using manual articulations and non-manual
elements to convey information. For sign language recognition and translation,
the majority of existing approaches directly encode RGB videos into hidden
representations. RGB videos, however, are raw signals with substantial visual
redundancy, leading the encoder to overlook the key information for sign
language understanding. To mitigate this problem and better incorporate domain
knowledge, such as handshape and body movement, we introduce a dual visual
encoder containing two separate streams to model both the raw videos and the
keypoint sequences generated by an off-the-shelf keypoint estimator. To make
the two streams interact with each other, we explore a variety of techniques,
including bidirectional lateral connection, sign pyramid network with auxiliary
supervision, and frame-level self-distillation. The resulting model is called
TwoStream-SLR, which is competent for sign language recognition (SLR).
TwoStream-SLR is extended to a sign language translation (SLT) model,
TwoStream-SLT, by simply attaching an extra translation network.
Experimentally, our TwoStream-SLR and TwoStream-SLT achieve state-of-the-art
performance on SLR and SLT tasks across a series of datasets including
Phoenix-2014, Phoenix-2014T, and CSL-Daily.Comment: Accepted by NeurIPS 202
Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model
Recently, vision-language pre-training shows great potential in
open-vocabulary object detection, where detectors trained on base classes are
devised for detecting new classes. The class text embedding is firstly
generated by feeding prompts to the text encoder of a pre-trained
vision-language model. It is then used as the region classifier to supervise
the training of a detector. The key element that leads to the success of this
model is the proper prompt, which requires careful words tuning and ingenious
design. To avoid laborious prompt engineering, there are some prompt
representation learning methods being proposed for the image classification
task, which however can only be sub-optimal solutions when applied to the
detection task. In this paper, we introduce a novel method, detection prompt
(DetPro), to learn continuous prompt representations for open-vocabulary object
detection based on the pre-trained vision-language model. Different from the
previous classification-oriented methods, DetPro has two highlights: 1) a
background interpretation scheme to include the proposals in image background
into the prompt training; 2) a context grading scheme to separate proposals in
image foreground for tailored prompt training. We assemble DetPro with ViLD, a
recent state-of-the-art open-world object detector, and conduct experiments on
the LVIS as well as transfer learning on the Pascal VOC, COCO, Objects365
datasets. Experimental results show that our DetPro outperforms the baseline
ViLD in all settings, e.g., +3.4 APbox and +3.0 APmask improvements on the
novel classes of LVIS. Code and models are available at
https://github.com/dyabel/detpro.Comment: Accepted by CVPR 202